{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Intro](#Intro)\n",
    "\t* [Clustering](#Clustering)\n",
    "\t* [Data Preprocessing](#Data-Preprocessing)\n",
    "\t* [Text Vectorization](#Text-Vectorization)\n",
    "\t\t* [Basic Vectorizer Methods](#Basic-Vectorizer-Methods)\n",
    "\t\t* [Words Embeddings [TODO]](#Words-Embeddings-[TODO])\n",
    "\t\t* [Similarity and Distance Metrics](#Similarity-and-Distance-Metrics)\n",
    "\t* [Dimensionality Reduction and Features Selection](#Dimensionality-Reduction-and-Features-Selection)\n",
    "\t* [Clustering](#Clustering)\n",
    "\t\t* [K-means](#K-means)\n",
    "\t\t\t* [Elbow Method](#Elbow-Method)\n",
    "\t* [Singular Value Decomposition (SVD)](#Singular-Value-Decomposition-%28SVD%29)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is a work in progress regarding text clustering task. Overall, the libraries used are Numpy, NLTK, Seaborn, Sklearn and Gensim.\n",
    "\n",
    "Feel free to report any feedback by creating a issue on Github, or contacting me directly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clustering is about categorizing/organizing/labelling objects such as to maximize the similarity between objects in one cluster/group (inner class similarity) while maximazing the dissimilarity between different clusters (inter class similarity).\n",
    "\n",
    "Clustering is an example of an unsupervised learning algorithm.\n",
    "\n",
    "In the following cells I will explore clustering related to text/sentences. In such context similarity should target the semantic and pragmatic meaning of the text: sentences with the same or closely similar meaning should fall into the same category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:42:54.944833",
     "start_time": "2017-08-31T16:42:54.929832"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import itertools \n",
    "import nltk\n",
    "import csv\n",
    "import scipy\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_context(\"notebook\", font_scale=1.5)\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start with clean sentences as data to feed to the pipeline. \n",
    "Different preprocessing steps might be needed or can be used to possibly improve final results; these include:\n",
    "* HTML or other markup tags removal [BeautifulSoup]\n",
    "* Remove possible non relevant content (e.g. URLs, numbers, standard strings)\n",
    "* Remove stopwords\n",
    "* Lemmization and stemming [NLTK]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:37:15.662427",
     "start_time": "2017-08-31T16:37:15.637425"
    },
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sentences</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A brown fox jumped on the lazy dog</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A brown fox jumped on the brown duck</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A brown fox jumped on the lazy elephant</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>An elephant is eating green grass near the alpaca</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A green alpaca tried to jump over an elephant</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>May you rest in a deep and dreamless slumber</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           sentences\n",
       "0                 A brown fox jumped on the lazy dog\n",
       "1               A brown fox jumped on the brown duck\n",
       "2            A brown fox jumped on the lazy elephant\n",
       "3  An elephant is eating green grass near the alpaca\n",
       "4      A green alpaca tried to jump over an elephant\n",
       "5       May you rest in a deep and dreamless slumber"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dummy example data\n",
    "vocabulary_size = 1000\n",
    "sentences = [\"A brown fox jumped on the lazy dog\", \n",
    "            \"A brown fox jumped on the brown duck\",\n",
    "            \"A brown fox jumped on the lazy elephant\",\n",
    "            \"An elephant is eating green grass near the alpaca\",\n",
    "            \"A green alpaca tried to jump over an elephant\",\n",
    "            \"May you rest in a deep and dreamless slumber\"]\n",
    "df = pd.DataFrame(sentences, columns=['sentences'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text Vectorization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to represent our text using a Vector Space Model, meaning that each sentence should be encoded as a continuous vector of numbers, such that semantic similarity between sentences are computable using mathematical distance metrics (e.g. Euclidean distance, Manhattan distance, cosine).\n",
    "\n",
    "Notice that for more complex techniques, your sentence can be encoded as a matrix (for example if each word is embedded as a vector), or more generally as a tensor (a N-dimensional vector)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Vectorizer Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Common ways to vectorize your sentences are based on words count. \n",
    "Each sentence is represented by a vector of length N, where N is the size of your vocabulary. Each element of the vector is then associated with a word (or N-gram), and has a value that depends on the technique used for the vectorization.\n",
    "* binarize\n",
    "* count\n",
    "* tf-idf (term frequency * inverse term frequency)\n",
    "* hashing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:37:20.699715",
     "start_time": "2017-08-31T16:37:20.695715"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:37:29.462216",
     "start_time": "2017-08-31T16:37:29.452215"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# This class accepts functions for preprocessing and tokenization, \n",
    "# so you can operate your data cleaning directly at this point\n",
    "vectorizer =TfidfVectorizer(analyzer=\"word\", max_features=vocabulary_size, stop_words=\"english\", ngram_range=(1,2))\n",
    "X = vectorizer.fit_transform(df['sentences'].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:37:38.056708",
     "start_time": "2017-08-31T16:37:38.052707"
    },
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6, 37)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:37:40.674857",
     "start_time": "2017-08-31T16:37:40.666857"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['alpaca',\n",
       " 'alpaca tried',\n",
       " 'brown',\n",
       " 'brown duck',\n",
       " 'brown fox',\n",
       " 'deep',\n",
       " 'deep dreamless',\n",
       " 'dog',\n",
       " 'dreamless',\n",
       " 'dreamless slumber',\n",
       " 'duck',\n",
       " 'eating',\n",
       " 'eating green',\n",
       " 'elephant',\n",
       " 'elephant eating',\n",
       " 'fox',\n",
       " 'fox jumped',\n",
       " 'grass',\n",
       " 'grass near',\n",
       " 'green',\n",
       " 'green alpaca',\n",
       " 'green grass',\n",
       " 'jump',\n",
       " 'jump elephant',\n",
       " 'jumped',\n",
       " 'jumped brown',\n",
       " 'jumped lazy',\n",
       " 'lazy',\n",
       " 'lazy dog',\n",
       " 'lazy elephant',\n",
       " 'near',\n",
       " 'near alpaca',\n",
       " 'rest',\n",
       " 'rest deep',\n",
       " 'slumber',\n",
       " 'tried',\n",
       " 'tried jump']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.29315886,  0.35750457,  0.        ,  0.        ,  0.        ,\n",
       "         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "         0.        ,  0.        ,  0.        ,  0.24750486,  0.        ,\n",
       "         0.        ,  0.        ,  0.        ,  0.        ,  0.29315886,\n",
       "         0.35750457,  0.        ,  0.35750457,  0.35750457,  0.        ,\n",
       "         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "         0.35750457,  0.35750457]])"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[4].toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Words Embeddings [TODO]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More advanced techniques might be able to capture a deeper semantic meaning of sentences\n",
    "\n",
    "* WordToVec\n",
    "* Doc2Vec\n",
    "* .."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarity and Distance Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can test how similar/close sentences are, or directly rank all your data in one round. A common measure used for this case and in NLP tasks in general is cosine similarity: it consider the cosine of the angle between two vectors which is 1 for 0 degrees, lower otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A green alpaca tried to jump over an elephant\n",
      "May you rest in a deep and dreamless slumber\n"
     ]
    }
   ],
   "source": [
    "# Retrieve the sentence of our example dataset most similar to some test sentences\n",
    "test_sentences = [\"How to put an alpaca to sleep\", \"How to achieve a deep sleep\"]\n",
    "test_data = vectorizer.transform(test_sentences)\n",
    "for y in test_data:\n",
    "    res = cosine_similarity(y, X)\n",
    "    index = np.argsort(res[0])[-1]\n",
    "    print(sentences[index])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dimensionality Reduction and Features Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Ref 1](http://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html)\n",
    "\n",
    "Reducing the dimensionality of your data can reduce the computation time for next steps in the pipeline, as well as improving your overall results. More data doesn't always imply better outcomes.\n",
    "\n",
    "An initial analysis can be operated considering factors that are indicative of a feature importance, these include:\n",
    "* missing values ratio (remove if percentage of missing values is higher than a predefined threshold)\n",
    "* low variance (remove if variance is lower than a predefined threshold). Given that variance is range dependent, normalized is required.\n",
    "\n",
    "A second step can rely on **correlation measures**, for which similar features are manually reduced to a single one.\n",
    "\n",
    "Using **machine learning models** to get insight about features importance is another good method that leverages the power already embedded in the models themselves. A good model is the Random Forest one, especially given how the results are easily interpretable. [??others]\n",
    "\n",
    "A different approach sees the measurements of results quality/performances using **different set of features**. This approach can be operated in two way: backward feature elimination (remove one at each step) and forward feature construction (add one at each step).\n",
    "\n",
    "A multitude of other techniques can be used, more complex and self-contained:\n",
    "* Principal Component Analysis (PCA) \n",
    "* Latent Dirichlet allocation (LDA)\n",
    "* Latent semantic analysis/indexing (LSA)\n",
    "* chi-squared\n",
    "* ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can simply treat our sentences as vectors and rely on the preferred clustering technique.\n",
    "As already mentioned, clustering is a type of unsupervised learning task.\n",
    "\n",
    "It is common for clustering algorithms to consist of two steps: initialization and refinement. The clustering algorithm used will start from the initial (possibly random) clustering state and iteratively refine our clusters, based on an a criterion function, which also defines the stopping criteria.\n",
    "\n",
    "Some possible clustering approaches are:\n",
    "* partitional (k-means)\n",
    "* hierarchical (agglomerative): build a tree representation of the data\n",
    "* density-based (DBSCAN)\n",
    "* spectral\n",
    "* mixtures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fast, partitional-type, noisy\n",
    "\n",
    "You specify how many clusters you want to generate, and the algorithm will iteratively process the data trying to define clusters that minimize inner-cluster difference and maximize the inter-cluster one.\n",
    "\n",
    "Given that the algorithm randomly initialize the clusters centroids, results can be non-optimal. To improve results it is common to run the algorithm a different number of times, and pick only the clustering which gives the best results (lowest value for the cost function). This process is already embedded in Sklearn implementation.\n",
    "\n",
    "There is no right number of clusters, the choice should be based on the context, requirements and possible visual/expert-knowledge judgment. A more theoretical approach consists in monitoring the cost function with different numbers of clusters, and spot the \"elbow\" of the cost curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
       "    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',\n",
       "    random_state=None, tol=0.0001, verbose=0)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Specify number of clusters and fit the data\n",
    "num_clusters = 3\n",
    "km = KMeans(num_clusters=num_clusters)\n",
    "km.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sentences</th>\n",
       "      <th>Cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A brown fox jumped on the lazy dog</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A brown fox jumped on the brown duck</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A brown fox jumped on the lazy elephant</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>An elephant is eating green grass near the alpaca</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A green alpaca tried to jump over an elephant</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>May you rest in a deep and dreamless slumber</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           sentences  Cluster\n",
       "0                 A brown fox jumped on the lazy dog        0\n",
       "1               A brown fox jumped on the brown duck        0\n",
       "2            A brown fox jumped on the lazy elephant        0\n",
       "3  An elephant is eating green grass near the alpaca        1\n",
       "4      A green alpaca tried to jump over an elephant        1\n",
       "5       May you rest in a deep and dreamless slumber        2"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Predict/retrieve the cluster ID of our data\n",
    "df['Cluster'] = km.predict(X)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Elbow Method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method consist in visualizing how some clustering metrics evolve based on the number of clusters used.\n",
    "Metrics like percentage of variance explained, or inner-cluster mean distance can be plotted for a range of values. Where improvements show a clean slow down (elbow) it might be an indicator of a good choice for value to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import cdist, pdist\n",
    "centroids = []\n",
    "dists = []\n",
    "data = X\n",
    "num_clusters = np.arange(1, 30)\n",
    "for k in num_clusters:\n",
    "    km = KMeans(n_clusters=k)\n",
    "    km.fit(data)\n",
    "    centroids.append(km.cluster_centers_)\n",
    "    dist = np.min(cdist(data, km.cluster_centers_, 'euclidean'), axis=1)\n",
    "    dist_avg = sum(dist)/len(data)\n",
    "    dists.append(dist_avg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sns.pointplot(x=num_clusters, y=dists)\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Singular Value Decomposition (SVD)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SVD: \"factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).\"\n",
    "\n",
    "[Ref_1](http://nbviewer.jupyter.org/github/fastai/numerical-linear-algebra/blob/master/nbs/2.%20Topic%20Modeling%20with%20NMF%20and%20SVD.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:40:50.627722",
     "start_time": "2017-08-31T16:40:50.622722"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = X.todense() #get our data matrix in dense representation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:41:03.455456",
     "start_time": "2017-08-31T16:41:03.447455"
    },
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6, 37)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:46:11.385068",
     "start_time": "2017-08-31T16:46:11.377068"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vocab = np.array(vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:44:01.165620",
     "start_time": "2017-08-31T16:44:01.146619"
    },
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(6, 6) (6,) (6, 37)\n"
     ]
    }
   ],
   "source": [
    "U, s, Vh = scipy.linalg.svd(X, full_matrices=False)\n",
    "print(U.shape, s.shape, Vh.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:46:25.847896",
     "start_time": "2017-08-31T16:46:25.832895"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "num_top_words=3\n",
    "\n",
    "def show_topics(a):\n",
    "    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]\n",
    "    topic_words = ([top_words(t) for t in a])\n",
    "    return [' '.join(t) for t in topic_words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-31T16:46:30.178143",
     "start_time": "2017-08-31T16:46:30.164142"
    },
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['dreamless dreamless slumber slumber',\n",
       " 'brown fox brown fox',\n",
       " 'deep deep dreamless dreamless']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_topics(Vh[:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda env:kaggle]",
   "language": "python",
   "name": "conda-env-kaggle-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "229px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
